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Abstract 

A  family  of  kernels  for  statistical  learning  is  introduced  that  exploits  the  geometric  struc¬ 
ture  of  statistical  models.  The  kernels  are  based  on  the  heat  equation  on  the  Riemannian 
manifold  defined  by  the  Fisher  information  metric  associated  with  a  statistical  family,  and 
generalize  the  Gaussian  kernel  of  Euclidean  space.  As  an  important  special  case,  kernels 
based  on  the  geometry  of  multinomial  families  are  derived,  leading  to  kernel-based  learn¬ 
ing  algorithms  that  apply  naturally  to  discrete  data.  Bounds  on  covering  numbers  and 
Rademacher  averages  for  the  kernels  are  proved  using  bounds  on  the  eigenvalues  of  the 
Laplacian  on  Riemannian  manifolds.  Experimental  results  are  presented  for  document  clas¬ 
sification,  for  which  the  use  of  multinomial  geometry  is  natural  and  well  motivated,  and 
improvements  are  obtained  over  the  standard  use  of  Gaussian  or  linear  kernels,  which  have 
been  the  standard  for  text  classification. 
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1  Introduction 


The  use  of  Mercer  kernels  for  transforming  linear  classification  and  regression  schemes  into 
nonlinear  methods  is  a  fundamental  idea,  one  that  was  recognized  early  in  the  development 
of  statistical  learning  algorithms  such  as  the  perceptron,  splines,  and  support  vector  ma¬ 
chines  (Aizerman  et  al.,  1964,  Kimeldorf  and  Wahba,  1971,  Boser  et  al.,  1992).  The  recent 
resurgence  of  activity  on  kernel  methods  in  the  machine  learning  community  has  led  to  the 
further  development  of  this  important  technique,  demonstrating  how  kernels  can  be  key 
components  in  tools  for  tackling  nonlinear  data  analysis  problems,  as  well  as  for  integrating 
data  from  multiple  sources. 

Kernel  methods  can  typically  be  viewed  either  in  terms  of  an  implicit  representation  of  a 
high  dimensional  feature  space,  or  in  terms  of  regularization  theory  and  smoothing  (Poggio 
and  Girosi,  1990).  In  either  case,  most  standard  Mercer  kernels  such  as  the  Gaussian  or 
radial  basis  function  kernel  require  data  points  to  be  represented  as  vectors  in  Euclidean 
space.  This  initial  processing  of  data  as  real-valued  feature  vectors,  which  is  often  car¬ 
ried  out  in  an  ad  hoc  manner,  has  been  called  the  “dirty  laundry”  of  machine  learning 
(Dietterich,  2002) — while  the  initial  Euclidean  feature  representation  is  often  crucial,  there 
is  little  theoretical  guidance  on  how  it  should  be  obtained.  For  example,  in  text  classifi¬ 
cation  a  standard  procedure  for  preparing  the  document  collection  for  the  application  of 
learning  algorithms  such  as  support  vector  machines  is  to  represent  each  document  as  a 
vector  of  scores,  with  each  dimension  corresponding  to  a  term,  possibly  after  scaling  by 
an  inverse  document  frequency  weighting  that  takes  into  account  the  distribution  of  terms 
in  the  collection  (Joachims,  2000).  While  such  a  representation  has  proven  to  be  effective, 
the  statistical  justification  of  such  a  transform  of  categorical  data  into  Euclidean  space  is 
unclear. 

Recent  work  by  Kondor  and  Lafferty  (2002)  was  directly  motivated  by  this  need  for 
kernel  methods  that  can  be  applied  to  discrete,  categorical  data,  in  particular  when  the 
data  lies  on  a  graph.  Kondor  and  Lafferty  (2002)  propose  the  use  of  discrete  diffusion 
kernels  and  tools  from  spectral  graph  theory  for  data  represented  by  graphs.  In  this  paper, 
we  propose  a  related  construction  of  kernels  based  on  the  heat  equation.  The  key  idea 
in  our  approach  is  to  begin  with  a  statistical  family  that  is  natural  for  the  data  being 
analyzed,  and  to  represent  data  as  points  on  the  statistical  manifold  associated  with  the 
Fisher  information  metric  of  this  family.  We  then  exploit  the  geometry  of  the  statistical 
family;  specifically,  we  consider  the  heat  equation  with  respect  to  the  Riemannian  structure 
given  by  the  Fisher  metric,  leading  to  a  Mercer  kernel  defined  on  the  appropriate  function 
spaces.  The  result  is  a  family  of  kernels  that  generalizes  the  familiar  Gaussian  kernel 
for  Euclidean  space,  and  that  includes  new  kernels  for  discrete  data  by  beginning  with 
statistical  families  such  as  the  multinomial.  Since  the  kernels  are  intimately  based  on 
the  geometry  of  the  Fisher  information  metric  and  the  heat  or  diffusion  equation  on  the 
associated  Riemannian  manifold,  we  refer  to  them  here  as  information  diffusion  kernels. 

One  apparent  limitation  of  the  discrete  diffusion  kernels  of  Kondor  and  Lafferty  (2002) 
is  the  difficulty  of  analyzing  the  associated  learning  algorithms  in  the  discrete  setting.  This 
stems  from  the  fact  that  general  bounds  on  the  spectra  of  finite  or  even  infinite  graphs  are 
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difficult  to  obtain,  and  research  has  concentrated  on  bounds  on  the  first  eigenvalues  for 
special  families  of  graphs.  In  contrast,  the  kernels  we  investigate  here  are  over  continuous 
parameter  spaces  even  in  the  case  where  the  underlying  data  is  discrete,  leading  to  more 
amenable  spectral  analysis.  We  can  draw  on  the  considerable  body  of  research  in  differential 
geometry  that  studies  the  eigenvalues  of  the  geometric  Laplacian,  and  thereby  apply  some 
of  the  machinery  that  has  been  developed  for  analyzing  the  generalization  performance  of 
kernel  machines  in  our  setting. 

Although  the  framework  proposed  is  fairly  general,  in  this  paper  we  focus  on  the  ap¬ 
plication  of  these  ideas  to  text  classification,  where  the  natural  statistical  family  is  the 
multinomial.  In  the  simplest  case,  the  words  in  a  document  are  modeled  as  independent 
draws  from  a  fixed  multinomial;  non-independent  draws,  corresponding  to  n-grams  or  more 
complicated  mixture  models  are  also  possible.  For  n-grarn  models,  the  maximum  likelihood 
multinomial  model  is  obtained  simply  as  normalized  counts,  and  smoothed  estimates  can 
be  used  to  remove  the  zeros.  This  mapping  is  then  used  as  an  embedding  of  each  document 
into  the  statistical  family,  where  the  geometric  framework  applies.  We  remark  that  the 
perspective  of  associating  multinomial  models  with  individual  documents  has  recently  been 
explored  in  information  retrieval,  with  promising  results  (Ponte  and  Croft,  1998,  Zhai  and 
Lafferty,  2001). 

The  statistical  manifold  of  the  n-dimensional  multinomial  family  comes  from  an  embed¬ 
ding  of  the  multinomial  simplex  into  the  n-dimensional  sphere  which  is  isometric  under  the 
the  Fisher  information  metric.  Thus,  the  multinomial  family  can  be  viewed  as  a  manifold 
of  constant  positive  curvature.  As  discussed  below,  there  are  mathematical  technicalities 
due  to  corners  and  edges  on  the  boundary  of  the  multinomial  simplex,  but  intuitively,  the 
multinomial  family  can  be  viewed  in  this  way  as  a  Riemannian  manifold  with  boundary;  we 
address  the  technicalities  by  a  “rounding”  procedure  on  the  simplex.  While  the  heat  kernel 
for  this  manifold  does  not  have  a  closed  form,  we  can  approximate  the  kernel  in  a  closed 
form  using  the  leading  term  in  the  parametrix  expansion,  a  small  time  asymptotic  expan¬ 
sion  for  the  heat  kernel  that  is  of  great  use  in  differential  geometry.  This  results  in  a  kernel 
that  can  be  readily  applied  to  text  documents,  and  that  is  well  motivated  mathematically 
and  statistically. 

We  present  detailed  experiments  for  text  classification,  using  both  the  WebKB  and 
Reuters  data  sets,  which  have  become  standard  test  collections.  Our  experimental  results 
indicate  that  the  multinomial  information  diffusion  kernel  performs  very  well  empirically. 
This  improvement  can  in  part  be  attributed  to  the  role  of  the  Fisher  information  metric, 
which  results  in  points  near  the  boundary  of  the  simplex  being  given  relatively  more  impor¬ 
tance  than  in  the  flat  Euclidean  metric.  Viewed  differently,  effects  similar  to  those  obtained 
by  heuristically  designed  term  weighting  schemes  such  as  inverse  document  frequency  are 
seen  to  arise  automatically  from  the  geometry  of  the  statistical  manifold. 

The  remaining  sections  are  organized  as  follows.  In  Section  2  we  review  the  relevant  con¬ 
cepts  that  are  required  from  Riemannian  geometry  and  define  the  heat  kernel  for  a  general 
Riemannian  manifold,  together  with  its  parametrix  expansion.  In  Section  3  we  define  the 
Fisher  metric  associated  with  a  statistical  manifold  of  distributions,  and  examine  in  some 
detail  the  special  cases  of  the  multinomial  and  spherical  normal  families;  the  proposed  use 
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of  the  heat  kernel  or  its  parametrix  approximation  on  the  statistical  manifold  is  the  main 
contribution  of  the  paper.  Section  4  derives  bounds  on  covering  numbers  and  Rademacher 
averages  for  various  learning  algorithms  that  use  the  new  kernels,  borrowing  results  from 
differential  geometry  on  bounds  for  the  geometric  Laplacian.  Section  5  describes  the  results 
of  applying  the  multinomial  diffusion  kernels  to  text  classification,  and  we  conclude  with  a 
discussion  of  our  results  in  Section  6. 

2  Riemannian  Geometry  and  the  Heat  Kernel 

We  begin  by  briefly  reviewing  some  of  the  elementary  concepts  from  Riemannian  geometry 
that  will  be  used  in  the  construction  of  information  diffusion  kernels,  since  these  concepts 
are  not  widely  used  in  machine  learning.  We  refer  to  Spivak  (1979)  for  details  and  further 
background,  or  Milnor  (1963)  for  an  elegant  and  concise  overview;  however  most  introduc¬ 
tory  texts  on  differential  geometry  include  this  material.  The  basic  properties  of  the  heat 
kernel  on  a  Riemannian  manifold  are  then  presented  in  Section  2.3.  An  excellent  intro¬ 
ductory  account  of  this  topic  is  given  by  Rosenberg  (1997),  and  an  authoritative  reference 
for  spectral  methods  in  Riemannian  geometry  is  Schoen  and  Yau  (1994).  Readers  whose 
differential  geometry  is  in  good  repair  may  wish  to  proceed  directly  to  Section  2.3.1  or  to 
Section  3. 

2.1  Basic  Definitions 

An  n-dimensional  differentiable  manifold  M  is  a  set  of  points  that  is  locally  equivalent  to 
Rn  by  smooth  transformations,  supporting  operations  such  as  differentiation.  Formally,  a 
differentiable  manifold  is  a  set  M  together  with  a  collection  of  local  charts  {( Ui ,  Pi)},  where 
Ut  C  M  with  U jC/j  =  M,  and  pi  :  Ut  C  M  — »  Rn  is  a  bijection.  For  each  pair  of  local 
charts  ( Ui,ipi )  and  ( Uj ,  tpj),  it  is  required  that  <pj ( U,L  n  Uf)  is  open  and  <pij  =  ipi  o  ipj 1  is  a 
diffeomorphism. 

The  tangent  space  TpM  =  Rn  at  p  €  M  can  be  be  thought  of  as  directional  derivatives 
operating  on  the  set  of  real  valued  differentiable  functions  /  :  M  — >  R.  Equiva¬ 

lently,  the  tangent  space  TpM  can  be  viewed  in  terms  of  an  equivalence  class  of  curves  on 
M  passing  through  p.  Two  curves  c\  :  (— e,  e)  — >  M  and  C2  :  (— e,  e)  — >  M  are  equivalent 
at  p  in  case  ci(0)  =  C2(0)  =  p  and  ip  o  c\  and  ip  o  02  are  tangent  at  p  for  some  local  chart  ip 
(and  therefore  all  charts),  in  the  sense  that  their  derivatives  at  0  exist  and  are  equal. 

In  many  cases  of  interest,  the  manifold  M  is  a  submanifold  of  a  larger  manifold,  often 
Rm,  to  >  n.  For  example,  the  open  n-dimensional  simplex,  defined  by 

Vn  =  [e  G  Kn+1  :  ES1  6i  =  1,  Oi  >  o}  (1) 

is  a  submanifold  of  Rn+1.  In  such  a  case,  the  tangent  space  of  the  submanifold  TpM  is 
a  subspace  of  TpRm,  and  we  may  represent  the  tangent  vectors  v  6  TpM  in  terms  of  the 
standard  basis  of  the  tangent  space  TpRm  =  Rm,  v  =  EIE  vi  e*-  The  open  n-simplex  is  a 
differential  manifold  with  a  single,  global  chart. 
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A  manifold  with  boundary  is  defined  similarly,  except  that  the  local  charts  (U,  (p)  satisfy 
(p(U)  C  Mn+,  thus  mapping  a  patch  of  M  to  the  half-space  Mn+  =  {x  £  Mn  \xn  >  0}.  In 
general,  if  U  and  V  are  open  sets  in  Mn+  in  the  topology  induced  from  Mn,  and  /  :  U  — >  V 
is  a  diffeomorphism,  then  /  induces  diffeomorphisms  Int/  :  Intt/  — >  IntE  and  df  :  dU  — > 
dV,  where  dA  =  A  U  (R"^1  x  {0})  and  IntA  =  iU{rGln|  xn  >  0}.  Thus,  it  makes  sense 
to  define  the  interior  Int M  =  Ut/<£>-1(Int(<^([/)))  and  boundary  d M  =  U u^P~1(d((p(U))) 
of  M.  Since  Int M  is  open  it  is  an  n-dimensional  manifold  without  boundary,  and  d M  is  an 
(n  —  l)-dimensional  manifold  without  boundary. 

If  /  :  M  — >  N  is  a  diffeomorphism  of  the  manifold  M  onto  the  manifold  N,  then  / 
induces  a  push-foward  mapping  /*  of  the  associated  tangent  spaces.  A  vector  field  X  £ 
TM  is  mapped  to  the  push-forward  f*X  £  TN,  satisfying  (f*X)(g)  =  X(g  o  f)  for  all 
g  £  C°°(N).  Intuitively,  the  push- forward  mapping  transforms  velocity  vectors  of  curves  to 
velocity  vectors  of  the  corresponding  curves  in  the  new  manifold.  Such  a  mapping  is  of  use 
in  transforming  metrics,  as  described  next. 

2.2  The  Geometric  Laplacian 

The  construction  of  our  kernels  is  based  on  the  geometric  Laplacian1.  In  order  to  define 
the  generalization  of  the  familiar  Laplacian  A  =  +  •  •  •  +  on  Rn  to  manifolds, 

one  needs  a  notion  of  geometry,  in  particular  a  way  of  measuring  lengths  of  tangent  vectors. 
A  Riemannian  manifold  ( M ,  g)  is  a  differentiable  manifold  M  with  a  family  of  smoothly 
varying  positive-definite  inner  products  g  =  gp  on  TpM  for  each  p  £  M.  Two  Riemannian 
manifolds  (M,  g)  and  ( N ,  h )  are  isometric  in  case  there  is  a  diffeomorphism  /  :  M  — >  N 
such  that 

gp{X,Y)  =  hm{UX,fmY)  (2) 

for  every  X,  Y  £  TpM  and  p  £  M.  Occasionally,  hard  computations  on  one  manifold  can 
be  transformed  to  easier  computations  on  an  isometric  manifold.  Every  manifold  can  be 
given  a  Riemannian  metric.  For  example,  every  manifold  can  be  embedded  in  Wn  for  some 
m  >  n  (the  Whitney  embedding  theorem),  and  the  Euclidean  metric  induces  a  metric  on 
the  manifold  under  the  embedding.  In  fact,  every  Riemannian  metric  can  be  obtained  in 
this  way  (the  Nash  embedding  theorem). 

In  local  coordinates,  g  can  be  represented  as  gp(v,w)  =  Yli  j  9ij(p)  viwj  where  g(p)  = 
[gij(p)\  is  a  non-singular,  symmetric  and  positive-definite  matrix  depending  smoothly  on  p. 
and  tangent  vectors  v  and  w  are  represented  in  local  coordinates  at  p  as  v  =  X^=i  vi  ®i\p  and 
w  =  i  wi  ®i\p-  As  an  example,  consider  the  open  re-dimensional  simplex  defined  in  (1).  A 
metric  on  Mn+1  expressed  by  the  symmetric  positive-definite  matrix  G  =  [gij\  £  ]R(n+1)x(n+1) 

1As  described  by  Nelson  (1968),  “The  Laplace  operator  in  its  various  manifestations  is  the  most  beautiful 
and  central  object  in  all  of  mathematics.  Probability  theory,  mathematical  physics,  Fourier  analysis,  partial 
differential  equations,  the  theory  of  Lie  groups,  and  differential  geometry  all  revolve  around  this  sun,  and 
its  light  even  penetrates  such  obscure  regions  as  number  theory  and  algebraic  geometry.” 
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induces  a  metric  on  Vn  as 


n+ 1 n+ 1 

9p(v,u)  =  (jp  uieiiYJi=l  viei)  =  EE  9ij  UiVj  (3) 

*= i  j= i 

The  metric  enables  the  definition  of  lengths  of  vectors  and  curves,  and  therefore  distance 
between  points  on  the  manifold.  The  length  of  a  tangent  vector  at  p  E  Af  is  given  by 
IMI  =  \J ( v ,  v)p,  v  E  TPM  and  the  length  of  a  curve  c  :  [ a,b ]  — >  Af  is  then  given  by 
L(c)  =  || c(t)  || dt  where  c(t)  is  the  velocity  vector  of  the  path  c  at  time  t.  Using  the 

above  definition  of  lengths  of  curves,  we  can  define  the  distance  d{x,  y )  between  two  points 
x,  y  E  Af  as  the  length  of  the  shortest  piecewise  differentiable  curve  connecting  x  and  y.  This 
geodesic  distance  d  turns  the  Riemannian  manifold  into  a  metric  space,  satisfying  the  usual 
properties  of  positivity,  symmetry  and  the  triangle  inequality.  Riemannian  manifolds  also 
support  convex  neighborhoods.  In  particular,  if  p  E  Af ,  there  is  an  open  set  U  containing 
p  such  that  any  two  points  of  U  can  be  connected  by  a  unique  minimal  geodesic  in  U . 

A  manifold  is  said  to  be  geodesically  complete  in  case  every  geodesic  curve  c(t),  t  E  [a,  b], 
can  be  extended  to  be  defined  for  all  t  £  1.  It  can  be  shown  (Milnor,  1963),  that  the 
following  are  equivalent:  (1)  Af  is  geodesically  complete,  (2)  d  is  a  complete  metric  on  Af, 
and  (3)  closed  and  bounded  subsets  of  Af  are  compact.  In  particular,  compact  manifolds 
are  geodesically  complete.  The  Hopf-Rinow  theorem  (Milnor,  1963)  asserts  that  if  Af  is 
complete,  then  any  two  points  can  be  joined  by  a  minimal  geodesic.  This  minimal  geodesic  is 
not  necessarily  unique,  as  seen  by  considering  antipodal  points  on  a  sphere.  The  exponential 
map  exp^,  maps  a  neighborhood  V  of  0  G  TXM  diffeomorphically  onto  a  neighborhood  of 
x  G  Af.  By  definition,  expxv  is  the  point  7„(1)  where  'yv  is  a  geodesic  starting  at  x  with 
initial  velocity  v  =  ^f\t= o-  Any  such  geodesic  satisfies  7 rv(s)  =  7 v(rs)  for  r  >  0.  This 
mapping  defines  a  local  coordinate  system  on  M  called  normal  coordinates,  under  which 
many  computations  are  especially  convenient. 

For  a  function  /  :  Af  — >  R,  the  gradient  grad  /  is  the  vector  field  defined  by 

(grad  f(p),X)  =  X(f)  (4) 

In  local  coordinates,  the  gradient  is  given  by 

(grad  f)t  =  Y,  <f  (5) 

j  '  3 

where  [g*J(p)]  is  the  inverse  of  [gl3 ip)]-  The  divergence  operator  is  defined  to  be  the  adjoint 
of  the  gradient,  allowing  “integration  by  parts”  on  manifolds  with  special  structure.  An 
orientation  of  a  manifold  is  a  smooth  choice  of  orientation  for  the  tangent  spaces,  meaning 
that  for  local  charts  <pi  and  <pj.  the  differential  D((pj  o  ipi)(x)  :  Rn  — >  Rn  is  orientation 
preserving,  so  the  sign  of  the  determinant  is  constant.  If  a  Riemannian  manifold  M  is 
orientable,  it  is  possible  to  define  a  volume  form  p,  where  if  v\,  v-z,  ■  ■  ■ ,  vn  G  TpM  (positively 
oriented),  then 
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A  volume  form,  in  turn,  enables  the  definition  of  the  divergence  of  a  vector  field  on  the 
manifold.  In  local  coordinates,  the  divergence  is  given  by 


1 


d 


dwx=vm^^i(V(iet9Xi 

Finally,  the  Laplace- Beltrami  operator  on  functions  is  defined  by 

A  =  div  o  grad 

which  in  local  coordinates  is  thus  given  by 


A  /  = 


1 


d 


\/det  g  A-'  dx 


df 


(7) 


(8) 


(9) 


These  definitions  preserve  the  familiar  intuitive  interpretation  of  the  usual  operators  in 
Euclidean  geometry;  in  particular,  the  gradient  points  in  the  direction  of  steepest  ascent 
and  the  divergence  measures  outflow  minus  inflow  of  liquid  or  heat. 


2.3  The  Heat  Kernel 

The  Laplacian  is  used  to  model  how  heat  will  diffuse  throughout  a  geometric  manifold;  the 
flow  is  governed  by  the  following  second  order  differential  equation  with  initial  conditions 

=  0  (10) 

/O,0)  =  f(x)  (11) 

The  value  f(x,  t )  describes  the  heat  at  location  x  at  time  t,  beginning  from  an  initial 
distribution  of  heat  given  by  f(x)  at  time  zero.  The  heat  or  diffusion  kernel  Kt(x,  y )  is  the 
solution  to  the  heat  equation  f(x,  t )  with  initial  condition  given  by  Dirac’s  delta  function  5y. 
As  a  consequence  of  the  linearity  of  the  heat  equation,  the  heat  kernel  can  be  used  to 
generate  the  solution  to  the  heat  equation  with  arbitrary  initial  conditions,  according  to 


f(x,t)  =  [  Kt(x,y)  f(y)dy  (12) 

JM 

As  a  simple  special  case,  consider  heat  flow  on  the  circle,  or  one-dinrensional  sphere 
M  =  S1.  Parameterizing  the  manifold  by  angle  9,  and  letting  f(6,t )  =  ^2JLoaj{t)  cos (j6) 
be  the  discrete  cosine  transform  of  the  solution  to  the  heat  equation,  with  initial  conditions 
given  by  aj( 0)  =  aj ,  it  is  seen  that  the  heat  equation  leads  to  the  equation 

(jft  ai(t)  +  32aj(t))  cos (jO)  =  0  (13) 

which  is  easily  solved  to  obtain  aj(t)  =  e--7”*  and  therefore  f(0,t )  =  )Cj=o  aj  e~^1  cos  (jO). 
As  the  time  parameter  t  gets  large,  the  solution  converges  to  f(9,t )  — >  ao,  which  is  the 
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average  value  of  /;  thus,  the  heat  diffuses  until  the  manifold  is  at  a  uniform  temperature. 
To  express  the  solution  in  terms  of  an  integral  kernel,  note  that  by  the  Fourier  inversion 
formula 

OO 

=  ^(/,e^)e-^e^  (14) 

3=0 

-i  p  OO 

=  v  (l5) 

27r  Js1  U 

thus  expressing  the  solution  as  f(9,  t )  =  fsl  Kt{9 ,  fa)  f(fa)  d(j)  for  the  heat  kernel 

1  OO 

Kt((f>,  9)  =  —^2e~j2t  cos(j(9  -  (/)))  (16) 

^  3=0 

This  simple  example  shows  several  properties  of  the  general  solution  of  the  heat  equation 
on  a  (compact)  Riemannian  manifold;  in  particular,  note  that  the  eigenvalues  of  the  kernel 
scale  as  A j  ~  e~^/d  where  the  dimension  in  this  case  is  d  =  1. 

When  M  =  M,  the  heat  kernel  is  the  familiar  Gaussian  kernel,  so  that  the  solution  to 
the  heat  equation  is  expressed  as 

f(x,t)  =  -7=  [  f(y)dy  (17) 

V47 rt  Jr 

and  it  is  seen  that  as  t  — >  oo,  the  heat  diffuses  out  “to  infinity”  so  that  f(x,  t )  — >  0. 

When  M  is  compact,  the  Laplacian  has  discrete  eigenvalues  0  =  no  <  //i  <  H2  ■  ■  ■  with 
corresponding  eigenfunctions  fa  satisfying  A  fa  =  — mfa .  When  the  manifold  has  a  bound¬ 
ary,  appropriate  boundary  conditions  must  be  imposed  in  order  for  A  to  be  self-adjoint. 
Dirichlet  boundary  conditions  set  fa\ qM  =  0  and  Neumann  boundary  conditions  require 
=  0  where  v  is  the  outer  normal  direction.  The  following  theorem  summarizes  the 

av  8M 

basic  properties  for  the  kernel  of  the  heat  equation  on  M;  we  refer  to  Schoen  and  Yau  (1994) 
for  a  proof. 

Theorem  1  Let  M  be  a  complete  Riemannian  manifold.  Then  there  exists  a  function 
K  e  C°°(M_|_  x  M  x  M),  called  the  heat  kernel,  which  satisfies  the  following  properties  for 
all  x,  y  £  M,  with  Kt(-,  •)  =  K(t ,  •,  •) 

1.  Kt(x,y)  =  Kt(y,x) 

2.  lim^o  Kt(x,  y)  =  Sx(y) 

3.  (A-g)Kt(x,y)  =  0 

4 ■  Kt(x,  y)  =  fM  Kt-S( x,  z)Ks(z,  y)  dz  for  any  s  >  0 

If  in  addition  M  is  compact,  then  Kt  can  be  expressed  in  terms  of  the  eigenvalues  and 
eigenfunctions  of  the  Laplacian  as  Kt(x,y )  =  <My)- 
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Properties  2  and  3  imply  that  Kt(x,y)  solves  the  heat  equation  in  x,  starting  from  a 
point  heat  source  at  y.  It  follows  that  etAf(x)  =  f(x,t)  =  JM  Kt(x,y)  f(y)  dy  solves  the 
heat  equation  with  initial  conditions  f(x,  0)  =  f(x),  since 


df(x,  t) 
dt 


f  dKt(x,y) 
m  dt 


f(y)  dy 


(18) 


A Kt(x,y)  f(y)  dy 


(19) 


=  A  f  Kt(x,y)f(y)dy  (20) 

JM 

=  A  f{x)  (21) 

and  limf_>o  f(x,  t)  =  fM  lim^o  Kt(x,  y)  dy  =  f(x).  Property  4  implies  that  etAesA  = 
e(i+s)A,  which  has  the  physically  intuitive  interpretation  that  heat  diffusion  for  time  t  is  the 
composition  of  heat  diffusion  up  to  time  s  with  heat  diffusion  for  an  additional  time  t  —  s. 
Since  etA  is  a  positive  operator, 


Kt{x,y)f{x)f{y)dxdy  =  [  f(x)  etAf{x)  dx 

JM 


=  (f,etAf)  >  0 


(22) 

(23) 


Thus  Kt(x,  y)  is  positive-definite.  In  the  compact  case,  positive-definiteness  follows  directly 
from  the  expansion  Kt(x,y )  =  0e~llit(j)i(x)  4>i(y),  which  shows  that  the  eigenvalues  of 
Kt  as  an  integral  operator  are  Together,  these  properties  show  that  Kt  defines  a 

Mercer  kernel. 

The  heat  kernel  Kt(x,y)  is  a  natural  candidate  for  measuring  the  similarity  between 
points  between  i,j/6  M,  while  respecting  the  geometry  encoded  in  the  metric  g.  Further¬ 
more  it  is,  unlike  the  geodesic  distance,  a  Mercer  kernel — a  fact  that  enables  its  use  in  statis¬ 
tical  kernel  machines.  When  this  kernel  is  used  for  classification,  as  in  our  text  classification 
experiments  presented  in  Section  5,  the  discriminant  function  yt{x)  =  J2iaiUid^t(x,Xi)  can 
be  interpreted  as  the  solution  to  the  heat  equation  with  initial  temperature  yo(xi)  =  atyt 
on  labeled  data  points  X{,  and  initial  temperature  yo(x)  =  0  elsewhere. 


2.3.1  The  parametrix  expansion 

For  most  geometries,  there  is  no  closed  form  solution  for  the  heat  kernel.  However,  the 
short  time  behavior  of  the  solutions  can  be  studied  using  an  asymptotic  expansion  called 
the  parametrix  expansion.  In  fact,  the  existence  of  the  heat  kernel,  as  asserted  in  the  above 
theorem,  is  most  directly  proven  by  first  showing  the  existence  of  the  parametrix  expansion. 
Although  it  is  local,  the  parametrix  expansion  contains  a  wealth  of  geometric  information, 
and  indeed  much  of  modern  differential  geometry,  notably  index  theory,  is  based  upon  this 
expansion  and  its  generalizations.  In  Section  5  we  will  employ  the  first-order  parametrix 
expansion  for  text  classification. 

Recall  that  the  heat  kernel  on  flat  n-dimensional  Euclidean  space  is  given  by 

=  (4rt)-»exp(-fcjT^ 
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(24) 


where  \\x  —  y ||2  =  Ya= i  lx*  —  2/i|2  the  squared  Euclidean  distance  between  x  and  y.  The 
parametrix  expansion  approximates  the  heat  kernel  locally  as  a  correction  to  this  Euclidean 
heat  kernel.  To  begin  the  definition  of  the  parametrix,  let 

ptn\x,y)  =  (47rt)"texp  (-d  i^o{x,y) +  ^{x,y)t  + - [- ^m(x ,  y)tm)  (25) 


for  currently  unspecified  functions  ipk(x,y),  but  where  d2(x,y)  now  denotes  the  square  of 
the  geodesic  distance  on  the  manifold.  The  idea  is  to  obtain  'ipk  recursively  by  solving  the 
heat  equation  approximately  to  order  tm,  for  small  diffusion  time  t. 

Let  r  =  d(x,  y)  denote  the  length  of  the  radial  geodesic  from  x  to  y  G  Vx  in  the  normal 
coordinates  defined  by  the  exponential  map.  For  any  functions  f{r)  and  h(r)  of  r,  it  can 
be  shown  that 


=  d2f  ;  d  (log  Vdet  g)  df 
dr2  dr  dr 

A  (fh)  =  fAh  +  hAf  +  2^ 
Starting  from  these  basic  relations,  some  calculus  shows  that 


d 


,(m) 


A  —  —  )  Pt{  =  (tm  AV’m)  (47 Tt)  2  exp  (  -  — 


At 


f  V det  g 

l  rpTl—  1 


when  ipk  are  defined  recursively  as 

V’o  = 

V’fc  =  r~kil) o  [  V’ o  1  (A0fc_i)  for  k  >  0 

Jo 


(26) 

(27) 

(28) 

(29) 

(30) 


With  this  recursive  definition  of  the  functions  the  expansion  (25),  which  is  defined  only 
locally,  is  then  extended  to  all  of  M  x  M  by  smoothing  with  a  “cut-off  function”  rj,  with 
the  specification  that  y  :  — >  [0, 1]  is  C°°  and 


y(r) 


0  r  >  1 
1  r  <  c 


(31) 


for  some  constant  0  <  c  <  1.  Thus,  the  order-m  parametrix  is  defined  as 

Ktm\xi  V )  =  V{d{x,  y))  Pt(m\x,  y)  (32) 

As  suggested  by  equation  (28),  Kif  m'>  is  an  approximate  solution  to  the  heat  equation, 
and  satisfies  Kt(x,y)  =  K^m\x,y )  +  0(tm)  for  x  and  y  sufficiently  close;  in  particular,  the 
parametrix  is  not  unique.  For  further  details  we  refer  to  (Schoen  and  Yau,  1994,  Rosenberg, 
1997). 

While  the  parametrix  is  not  in  general  positive-definite,  and  therefore  does  not 

define  a  Mercer  kernel,  it  is  positive-definite  for  t  sufficiently  small.  In  particular,  define 
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fit)  =  min  spec  {K™),  where  minspec  denotes  the  smallest  eigenvalue.  Then  /  is  a  contin¬ 
uous  function  with  /( 0)  =  1  since  K^n'>  =  I.  Thus,  there  is  some  time  interval  [0,  e)  for 
which  K^m'>  is  positive-definite  in  case  t  6  [0,  e).  This  fact  will  be  used  when  we  employ  the 
parametrix  approximation  to  the  heat  kernel  for  statistical  learning. 

3  Diffusion  Kernels  on  Statistical  Manifolds 

We  now  proceed  to  the  main  contribution  of  the  paper,  which  is  the  application  of  the  heat 
kernel  constructions  reviewed  in  the  previous  section  to  the  geometry  of  statistical  families, 
in  order  to  obtain  kernels  for  statistical  learning. 

Under  some  mild  regularity  conditions,  general  parametric  statistical  families  come 
equipped  with  a  canonical  geometry  based  on  the  Fisher  information  metric.  This  geometry 
has  long  been  recognized  (Rao,  1945),  and  there  is  a  rich  line  of  research  in  statistics,  with 
threads  in  machine  learning,  that  has  sought  to  exploit  this  geometry  in  statistical  analysis; 
see  Kass  (1989)  for  a  survey  and  discussion,  or  the  monographs  by  Kass  and  Vos  (1997) 
and  Arnari  and  Nagaoka  (2000)  for  more  extensive  treatments. 

We  remark  that  in  spite  of  the  fundamental  nature  of  the  geometric  perspective  in 
statistics,  many  researchers  have  concluded  that  while  it  occasionally  provides  an  interesting 
alternative  interpretation,  it  has  not  contributed  new  results  or  methods  that  cannot  be 
obtained  through  more  conventional  analysis.  However  in  the  present  work,  the  kernel 
methods  we  propose  can,  arguably,  be  motivated  and  derived  only  through  the  geometry  of 
statistical  manifolds.2 

3.1  Geometry  of  Statistical  Families 

Let  =  { p{ ■  |  0)}eee  be  an  n-dimensional  regular  statistical  family  on  a  set  X.  Thus,  we 
assume  that  0  C  Rn  is  open,  and  that  there  is  a  <r-finite  measure  //  on  X,  such  that  for 
each  9  G  0,  p(- 1  9)  is  a  density  with  respect  to  p,  so  that  [xpix  \  9)  dpix)  =  1.  We  identify 
the  manifold  M  with  0  by  assuming  that  for  each  x  G  X  the  mapping  0  *  p(x  |  6)  is  C°°. 
Below,  we  will  discuss  cases  where  0  is  closed,  leading  to  a  manifold  M  with  boundary. 

Let  di  denote  d/d0i,  and  £o(x)  =  log p(x  |  9).  The  Fisher  information  metric  at  9  6  0 
is  defined  in  terms  of  the  matrix  g(0)  6  Mnxn  given  by 

9iji9 )  =  Eg[di£edj£g\  =  f  p(x  |  9)di  logp(x  |  9)  djlogpix  \  9)  dpix)  (33) 

Jx 

Since  the  score  Sii9)  =  dilg  has  mean  zero,  g%ji9)  can  be  seen  as  the  variance  of  Sj(6l),  and  is 
therefore  positive-definite.  By  assumption,  it  is  smoothly  varying  in  9,  and  therefore  defines 
a  Riemannian  metric  on  0  =  M. 

2  By  a  statistical  manifold  we  mean  simply  a  manifold  of  densities  together  with  the  metric  induced  by 
the  Fisher  information  matrix,  rather  than  the  more  general  notion  of  a  Riemannian  manifold  together  with 
a  (possibly  non-metric)  connection,  as  defined  by  Lauritzen  (1987). 
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An  equivalent  and  sometimes  more  suggestive  form  of  the  Fisher  information  matrix,  as 
will  be  seen  below  for  the  case  of  the  multinomial,  is 


gij(O)  =  4  /  diy/p[x  |  9)  dj y/p(x  \  9)  dp{x) 

J  x 

Yet  another  equivalent  form  is  gij(9 )  =  —Eg[djdile\.  To  see  this,  note  that 


Eg[djdi£e\  =  /  p(x  |  9)djdilogp(x  \  9)  dp(x) 


ix 


,  djp(x  |  9) 


=  ~  I  p{x\6)^j-T—rdip{x\e)d^{x)  -  j  djdip(x\9)dp(x) 


ix  p{x\0) 

=  -  [  p(x\  9) 


ix 


djp(x  |  9)  dip(x  |  6) 


ix 


p(x  |  9)  p(x  |  9)  Mx)-dAjxP(x\9)d^(x) 


=  —  p(x  |  9)djlogp(x  |  9)di  \ogp(x  \  9)  dp(x) 
=  -9ij{9) 


(34) 

(35) 

(36) 

(37) 

(38) 

(39) 


Since  there  are  many  possible  choices  of  metric  on  a  given  differentiable  manifold,  it 
is  important  to  consider  the  motivating  properties  of  the  Fisher  information  metric.  In¬ 
tuitively,  the  Fisher  information  may  be  thought  of  as  the  amount  of  information  a  single 
data  point  supplies  with  respect  to  the  problem  of  estimating  the  parameter  9.  This  in¬ 
terpretation  can  be  justified  in  several  ways,  notably  through  the  efficiency  of  estimators. 
In  particular,  the  asymptotic  variance  of  the  maximum  likelihood  estimator  9  obtained 
using  a  sample  of  size  n  is  (ng{9))~1.  Since  the  MLE  is  asymptotically  unbiased,  the  in¬ 
verse  Fisher  information  represents  the  asymptotic  fluctuations  of  the  MLE  around  the  true 
value.  Moreover,  by  the  Cramer-Rao  lower  bound,  the  variance  of  any  unbiased  estimator  is 
bounded  from  below  by  ( ng(9 ))-1.  Additional  motivation  for  the  Fisher  information  metric 
is  provided  by  the  results  of  Cencov  (1982),  which  characterize  it  as  the  only  metric  (up 
to  multiplication  by  a  constant)  that  is  invariant  with  respect  to  certain  probabilistically 
meaningful  transformations  called  congruent  embeddings. 

The  connection  with  another  familiar  similarity  measure  is  worth  noting  here.  If  p  and  q 
are  two  densities  on  X  with  respect  to  /j,  the  Kullback-Leibler  divergence  D(p,  q )  is  defined 
by 

D(p,  q)=  f  p{x)  log  dp(x)  (40) 

■lx  q{x) 

The  Kullback-Leibler  divergence  behaves  at  nearby  points  like  the  square  of  the  information 
distance.  More  precisely,  it  can  be  shown  that 


lim 

q-^v  2D(p,  q) 


=  1 


(41) 


where  the  convergence  is  uniform  as  d(p,  q)  — »  0.  As  we  comment  below,  this  relationship 
may  be  of  use  in  approximating  information  diffusion  kernels  for  complex  models. 

The  following  two  basic  examples  illustrate  the  geometry  of  the  Fisher  information  met¬ 
ric  and  the  associated  diffusion  kernel  it  induces  on  a  statistical  manifold.  The  spherical 


11 


normal  family  corresponds  to  a  manifold  of  constant  negative  curvature,  and  the  multino¬ 
mial  corresponds  to  a  manifold  of  constant  positive  curvature.  The  multinomial  will  be  the 
most  important  example  that  we  develop,  and  we  report  extensive  experiments  with  the 
resulting  kernels  in  Section  5. 


3.2  Diffusion  Kernels  for  Gaussian  Geometry 


Consider  the  statistical  family  given  by  5  =  {p(-  |  #)}ee0  where  9  =  (ji.  a)  and  p(-  |  (p,  a))  = 
M(n,aln-i),  the  Gaussian  having  mean  \i  e  Rn_1  and  variance  <jln- 1,  with  a  >  0.  Thus, 
0  =  r-1  x  r+. 

To  compute  the  Fisher  information  metric  for  this  family,  it  is  convenient  to  use  the 
general  expression  given  by  equation  (39).  Let  <9*  =  d/dpi  for  i  =  1 . . .  n—  1,  and  dn  =  d/dcr. 
Then  simple  calculations  yield,  for  1  <  i,j  <  n  —  1 


9ij(0) 


9ni{9) 


9nn(9) 


n—  1 


X  dA  -  Y 


k=  1 


(xk  -  Pkf 

2a'2 


p(x\6) dx 


a 


2 


a 


7  /  (xi-  m)p(x  I  0)  dx 
J  R™-1 


dx 


(xk  ~  9kf 
2a2 


[  dndn(-J2 
9m-1  V  k= 1 

}  r 

4/  Y(xk~  9k)2p(x\0)dx 

9k-1  fc=1 


—  (n  —  1)  log  a  p(x  |  9)  dx 


n  —  1 


a “ 


2(n  —  1) 

r2 


a 


(42) 

(43) 

(44) 

(45) 

(46) 

(47) 

(48) 

(49) 


Letting  6'  be  new  coordinates  defined  by  9\  =  /i,  for  1  <  i  <  n  —  1  and  0'n  =  \j2(n  —  1)  a, 
we  see  that  the  Fisher  information  matrix  is  given  by 

=  ^2Sd  (5°) 

Thus,  the  Fisher  information  metric  gives  0  =  R—1  x  R+  the  structure  of  the  upper  half 
plane  in  hyperbolic  space.  The  distance  minimizing  or  geodesic  curves  in  hyperbolic  space 
are  straight  lines  or  circles  orthogonal  to  the  mean  subspace. 

In  particular,  the  univariate  normal  density  has  hyperbolic  geometry.  As  a  generalization 
in  this  2-dinrensional  case,  any  location-scale  family  of  densities  is  seen  to  have  hyperbolic 
geometry  (Kass  and  Vos,  1997).  Such  families  have  densities  of  the  form 


p(x\{p,a))  =  Y  (^~^j 


(51) 
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Figure  1:  Example  decision  boundaries  for  a  kernel-based  classifier  using  information  diffu¬ 
sion  kernels  for  spherical  normal  geometry  with  d  =  2  (right),  which  has  constant  negative 
curvature,  compared  with  the  standard  Gaussian  kernel  for  flat  Euclidean  space  (left).  Two 
data  points  are  used,  simply  to  contrast  the  underlying  geometries.  The  curved  decision 
boundary  for  the  diffusion  kernel  can  be  interpreted  statistically  by  noting  that  as  the 
variance  decreases  the  mean  is  known  with  increasing  certainty. 


where  (//,  a)  E  M  x  R_|_  and  /  :  M  — >  M. 

The  heat  kernel  on  the  hyperbolic  space  Hn  has  the  following  explicit  form  (Grigor’yan 
and  Noguchi,  1998).  For  odd  n  =  2m  +  1  it  is  given  by 


Kt(x,x') 


(-l)m  1  /  1  d\m  f  2  r2 

[sinh^fr)  exp  y-mt-  — 


(52) 


and  for  even  n  =  2m  +  2  it  is  given  by 


(  (2m+l)2t  s2  \ 

A_(-l)m  y/2  (  1  8\m  [°°sex V{-  4  ~4fj^ 

2mTTm  y/Anf  \sinhr  dr)  Jr  Vcosh  s  -  cosh  r 


(53) 


where  r  =  d(x,x')  is  the  geodesic  distance  between  the  two  points  in  HP1.  If  only  the  mean 
9  =  fi  is  unspecified,  then  the  associated  kernel  is  the  standard  Gaussian  RBF  kernel. 

A  possible  use  for  this  kernel  in  statistical  learning  is  where  data  points  are  naturally 
represented  as  sets.  That  is,  suppose  that  each  data  point  is  of  the  form  x  =  {x\,X2,  ■  ■  ■  xm} 
where  Xi  E  Mn_1.  Then  the  data  can  be  represented  according  to  the  mapping  which  sends 
each  group  of  points  to  the  corresponding  Gaussian  under  the  MLE:  x  e- *•  (f2(x),d(x))  where 

£(*)  =  ^Eixi  and  °{xY  =  h  £t  -  Kx))2- 

In  Figure  3.2  the  diffusion  kernel  for  hyperbolic  space  H2  is  compared  with  the  Euclidean 
space  Gaussian  kernel.  The  curved  decision  boundary  for  the  diffusion  kernel  makes  intuitive 
sense,  since  as  the  variance  decreases  the  mean  is  known  with  increasing  certainty. 

Note  that  we  can,  in  fact,  consider  M  as  a  manifold  with  boundary  by  allowing  a  >  0 
to  be  non-negative  rather  than  strictly  positive  a  >  0.  In  this  case,  the  densities  on  the 
boundary  become  singular,  as  point  masses  at  the  mean;  the  boundary  is  simply  given  by 
dM  =  Mn_1,  which  is  a  manifold  without  boundary,  as  required. 
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3.3  Diffusion  Kernels  for  Multinomial  Geometry 

We  now  consider  the  statistical  family  of  the  multinomial  over  n  +  1  outcomes,  given  by 
5  =  {p(- 1  ^)}ee©  where  9  =  (61,62, ,  9n)  with  9{  €  (0, 1)  and  E”=i  @i  <  1-  The  parameter 
space  0  is  the  open  n-simplex  Vn  defined  in  equation  (1),  a  submanifold  of  Rn+1. 

To  compute  the  metric,  let  x  =  (x\,X2,  ■  ■  ■ ,  xn+\)  denote  one  draw  from  the  multinomial, 
so  that  Xi  €  {0, 1}  and  YUxi  =  1-  The  log-likelihood  and  its  derivatives  are  then  given  by 


log  p(x  |  6) 

Slog p(x  |  6) 
ddr 

d 2  log p(x  |  6) 
ddiddj 


n+ 1 

y  xi  log  di 

i=  1 
Xi 
6t 


(54) 

(55) 

(56) 


Since  Vn  is  an  n-dimensional  submanifold  of  Mn+1,  we  can  express  u,  v  G  TqM  as  (n  +  1)- 
dimensional  vectors  in  TgM”"1"1  =  Mn+1;  thus,  u  =  utei,  v  =  ^"jj1  Uje,.  Note  that  due 

to  the  constraint  the  sum  of  the  n  +  1  components  of  a  tangent  vector  must 

be  zero.  A  basis  for  TqM  is 


(l,0,...,0,-l)T,e2  =  (0,1,0,...,  0,-l)T,...,en  =  (0,0,...,  0,1, -1)T} 


(57) 


Using  the  definition  of  the  Fisher  information  metric  in  equation  (35)  we  then  compute 


(u,v)e 


n+ 1 n+1 

y  uivjEo 
i=  1  j= 1 


d2  log  p  (x  |  6) 
ddiddj 


n+l 

UiViE  {-Xi/62} 


i= 1 
n+l 


(58) 

(59) 

(60) 


While  geodesic  distances  are  difficult  to  compute  in  general,  in  the  case  of  the  multi¬ 
nomial  information  geometry  we  can  easily  compute  the  geodesics  by  observing  that  the 
standard  Euclidean  metric  on  the  surface  of  the  positive  n-sphere  is  the  pull-back  of  the 
Fisher  information  metric  on  the  simplex.  This  relationship  is  suggested  by  the  form  of  the 
Fisher  information  given  in  equation  (34). 

To  be  concrete,  the  transformation  F(9\, . . . ,  0n+i)  =  (2 \/di, . . .  ,2y/9n+i)  is  a  diffeo- 
morphism  of  the  n-simplex  Vn  onto  the  positive  portion  of  the  n-sphere  of  radius  2;  denote 
this  portion  of  the  sphere  as  5+  =  j$  6  Mn+1  :  Y^'i=i  =  2,  6%  >  0  j.  Given  tangent  vec¬ 
tors  u  =  ES1  uieii  v  =  ES1  the  pull-back  of  the  Fisher  information  metric  through 
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Figure  2:  Equal  distance  contours  on  V-2  from  the  upper  right  edge  (left  column),  the  center 
(center  column),  and  lower  right  corner  (right  column).  The  distances  are  computed  using 
the  Fisher  information  metric  g  (top  row)  or  the  Euclidean  metric  (bottom  row). 


F -1  is 


hg(u,  v) 


(n+1  n+1  ^ 

F*]  Y  Ukek,  FT1  ^2  viei 

k= 1  1=1  / 

n+1  n+1 

EE  UkVigg2/4:{F^  1ek,F^  +) 

k= 1  1=1 

n+1 n+1  . 

Y  Y  UkVl  Y  Tfl  ( FZ1ek)i  ( F~lei)i 

k=  1  1=1  i 

n+1  n+1 

YYu^Yd2  2 


of 

4  OkSki  Orfii 


k= 1 1=1 


n+1 

y fujVj 

i=  1 


(61) 

(62) 

(63) 

(64) 

(65) 


Since  the  transformation  F  :  ( Vn,g )  — >  (<S+,/i)  is  an  isometry,  the  geodesic  distance 
d(9,9')  on  Vn  nray  be  computed  as  the  shortest  curve  on  5+  connecting  F(9)  and  F(9'). 
These  shortest  curves  are  portions  of  great  circles — the  intersection  of  a  two  dimensional 


15 


plane  and  5+ — and  their  length  is  given  by 


d(0, 9')  =  2  arccos 


(66) 


In  Section  3.1  we  noted  the  connection  between  the  Kullback-Leibler  divergence  and  the 
information  distance.  In  the  case  of  the  multinomial  family,  there  is  also  a  close  relationship 
with  the  Hellinger  distance.  In  particular,  it  can  be  easily  shown  that  the  Hellinger  distance 


is  related  to  d(0,  9')  by 

dH(0,0')  =  2sin(d(M/)/4)  (68) 

Thus,  as  O'  — >  9 ,  dn  agrees  with  \d  to  second  order: 

dH(P,  9')  =  l d(9 , 9')  +  O(d3(0, 9'))  (69) 

The  Fisher  information  metric  places  greater  emphasis  on  points  near  the  boundary, 
which  is  expected  to  be  important  for  text  problems,  which  typically  have  sparse  statistics. 
Figure  2  shows  equal  distance  contours  on  V2  using  the  Fisher  information  and  the  Euclidean 
metrics. 

While  the  spherical  geometry  has  been  derived  for  a  finite  multinomial,  the  same  geom¬ 
etry  can  be  used  non-parametrically  for  an  arbitrary  subset  of  probability  measures,  leading 
to  spherical  geometry  in  a  Hilbert  space  (Dawid,  1977). 


3.3.1  The  Multinomial  Diffusion  Kernel 

Unlike  the  explicit  expression  for  the  Gaussian  geometry  discussed  above,  there  is  not  an 
explicit  form  for  the  heat  kernel  on  the  sphere,  nor  on  the  positive  orthant  of  the  sphere. 
We  will  therefore  resort  to  the  parametrix  expansion  to  derive  an  approximate  heat  kernel 
for  the  multinomial. 

Recall  from  Section  2.3.1  that  the  parametrix  is  obtained  according  to  the  local  expan¬ 
sion  given  in  equation  (25),  and  then  extending  this  smoothly  to  zero  outside  a  neighborhood 
of  the  diagonal,  as  defined  by  the  exponential  map.  As  we  have  just  derived,  this  results  in 
the  following  parametrix  for  the  multinomial  family: 

Pt{m)(9,  9')  =  (4vrt)_i  exp  ^_dlCGOb  ^  {^o{0,  O')  H - f-  ^ m(0 ,  0')tm)  (70) 

The  first-order  expansion  is  thus  obtained  as 

KT\0,0,)  =  r](d(0,0'))p(°\0,0')  (71) 
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Figure  3:  Example  decision  boundaries  using  support  vector  machines  with  information 
diffusion  kernels  for  trinomial  geometry  on  the  2-simplex  (top  right)  compared  with  the 
standard  Gaussian  kernel  (left). 


Now,  for  the  ?r-sphere  it  can  be  shown  that  the  function  i/jq  of  (29),  which  is  the  leading 
order  correction  of  the  Gaussian  kernel  under  the  Fisher  information  metric,  is  given  by 


ipo(r)  = 


(  \/det  g\  2 
V  r 


n—  1 


("— l) 

sin  r'  2 


1  +  r2  +  ("  -  !>P"  -  11  r4  +  o(r6) 


12 


1440 


(72) 

(73) 

(74) 


(Berger  et  al.,  1971).  Thus,  the  leading  order  parametrix  for  the  multinomial  diffusion 
kernel  is 


Pt{0\e,e')  =  (4vrf)— i  exp  (  -^{0,6') 


1 


At 


sin  d(9,  6' 


 (n-l) 


In  our  experiments  we  approximate  this  kernel  further  as 


d(M') 

Pt°\o,6')  =  (47rt)_2  exp  ^  arccos 2{V0  •  v^)^ 


(75) 


(76) 


by  appealing  to  the  asymptotic  expansion  in  (74);  note  that  (sinr/r)-ra  blows  up  for  large  r. 
In  Figure  3  the  kernel  (76)  is  compared  with  the  standard  Euclidean  space  Gaussian  kernel 
for  the  case  of  the  trinomial  model,  d  =  2,  using  an  SVM  classifier. 


3.3.2  Rounding  the  Simplex 

The  case  of  multinomial  geometry  poses  some  technical  complications  for  the  analysis  of 
diffusion  kernels,  due  to  the  fact  that  the  open  simplex  is  not  complete,  and  moreover,  its 
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Figure  4:  Rounding  the  simplex.  Since  the  closed  simplex  is  not  a  manifold  with  boundary, 
we  carry  out  a  “rounding”  procedure  to  remove  edges  and  corners.  The  e-rounded  simplex 
is  the  closure  of  the  union  of  all  e-balls  lying  within  the  open  simplex. 

closure  is  not  a  differentiable  manifold  with  boundary.  Thus,  it  is  technically  not  possible 
to  apply  several  results  from  differential  geometry,  such  as  bounds  on  the  spectrum  of  the 
Laplacian,  as  adopted  in  Section  4.  We  now  briefly  describe  a  technical  “patch”  that  allows 
us  to  derive  all  of  the  needed  analytical  results,  without  sacrificing  in  practice  any  of  the 
methodology  that  has  been  derived  so  far. 

Let  An  =  Vn  denote  the  closure  of  the  open  simplex;  thus  An  is  the  usual  probability 
simplex  which  allows  zero  probability  for  some  items.  However,  it  does  not  form  a  compact 
manifold  with  boundary  since  the  boundary  has  edges  and  corners.  In  other  words,  local 
charts  <p  :  U  — >  Mn+  cannot  be  defined  to  be  differentiable.  To  adjust  for  this,  the  idea  is  to 
“round  the  edges”  of  An  to  obtain  a  subset  that  forms  a  compact  manifold  with  boundary, 
and  that  closely  approximates  the  original  simplex. 

For  e  >  0,  let  Be(x)  =  {y\  \\x  —  y\\  <  e}  denote  the  open  Euclidean  ball  of  radius  e 
centered  at  x.  Denote  by  Ce(Vn)  the  e-ball  centers  of  Vn,  the  points  of  the  simplex  whose 
e-balls  lie  completely  within  the  simplex: 

Ce{Vn)  =  {x  €  Vn  :  Be(x)  C  Vn}  (77) 

Finally,  let  Vfn  denote  the  e-interior  of  Vn,  which  we  define  as  the  union  of  all  e-balls 
contained  in  Vn: 

Ven  =  U  Be(x)  (78) 

XGCeiVn) 

The  e-rounded  simplex  A^  is  then  defined  as  the  closure  A^  =  V(n . 

The  rounding  procedure  that  yields  A|  is  suggested  by  Figure  4.  Note  that  in  general  the 
e-rounded  simplex  A^  will  contain  points  with  a  single,  but  not  more  than  one  component 
having  zero  probability.  The  set  Aen  forms  a  compact  manifold  with  boundary,  and  its  image 
under  the  isometry  F  :  (' Vn,g )  — >  (5+,/i)  is  a  compact  submanifold  with  boundary  of  the 
n-sphere. 

Whenever  appealing  to  results  for  compact  manifolds  with  boundary  in  the  following, 
it  will  be  tacitly  assumed  that  the  above  rounding  procedure  has  been  carried  out  in  the 
case  of  the  multinomial. 
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4  Spectral  Bounds  on  Covering  Numbers  and  Rademacher 
Averages 

We  now  turn  to  establishing  bounds  on  the  generalization  performance  of  kernel  machines 
that  use  information  diffusion  kernels.  We  begin  by  adopting  the  approach  of  Guo  et  al. 
(2002),  estimating  covering  numbers  by  making  use  of  bounds  on  the  spectrum  of  the  Lapla- 
cian  on  a  Rienrannian  manifold,  rather  than  on  VC  dimension  techniques;  these  bounds  in 
turn  yield  bounds  on  the  expected  risk  of  the  learning  algorithms.  Our  calculations  give 
an  indication  of  how  the  underlying  geometry  influences  the  entropy  numbers,  which  are 
inverse  to  the  covering  numbers.  We  then  show  how  bounds  on  Rademacher  averages  may 
be  obtained  by  plugging  in  the  spectral  bounds  from  differential  geometry.  The  primary 
conclusion  that  is  drawn  from  these  analyses  is  that  from  the  point  of  view  of  generalization 
error  bounds,  diffusion  kernels  behave  essentially  the  same  as  the  standard  Gaussian  kernel. 

4.1  Covering  Numbers 

We  begin  by  recalling  the  main  result  of  Guo  et  al.  (2002),  modifying  their  notation  slightly 
to  conform  with  ours.  Let  M  C  be  a  compact  subset  of  d-dinrensional  Euclidean  space, 
and  suppose  that  K  :  M  x  M  — »  1  is  a  Mercer  kernel.  Denote  by  Ai  >  A2  >  •  •  •  >  0 
the  eigenvalues  of  K,  i.e. ,  of  the  mapping  /  1— >  fM  K(-,y)  f(y )  dy,  and  let  Vh'(')  denote  the 
corresponding  eigenfunctions.  We  assume  that  Ck  ==  sup^  [IV’jlloo  <  00. 

Given  m  points  Xi  €  M,  the  kernel  hypothesis  class  for  x  =  {xi}  with  weight  vector 
bounded  by  R  is  defined  as  the  collection  of  functions  on  x  given  by 

Fr{x)  =  {/  :  f(xi)  =  (w, &(xi))  for  some  \\w\\  <  R}  (79) 

where  $(•)  is  the  mapping  from  M  to  feature  space  defined  by  the  Mercer  kernel,  and  (-,  •) 
and  || '||  denote  the  corresponding  Hilbert  space  inner  product  and  norm.  It  is  of  interest 
to  obtain  uniform  bounds  on  the  covering  numbers  A f(e,JrR(x)),  defined  as  the  size  of  the 
smallest  e-cover  of  J-r(x)  in  the  metric  induced  by  the  norm  || / 1| ^  x  =  maxj=i;...)m  l/(*t)|- 
The  following  is  the  main  result  of  Guo  et  al.  (2002). 

Theorem  2  Given  an  integer  n  €  N,  let  j*  denote  the  smallest  integer  j  for  which 

Ai+!  <  (Aln2  ^  (8°) 

and  define 

(81) 

Then  sup{a..}eMm  A f (e*n, F R{x))  <  n. 
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To  apply  this  result,  we  will  obtain  bounds  on  the  indices  j*  using  spectral  theory  in 
Riemannian  geometry.  The  following  bounds  on  the  eigenvalues  of  the  Laplacian  are  due 
to  Li  and  Yau  (1980). 

Theorem  3  Let  M  be  a  compact  Riemannian  manifold  of  dimension  d  with  non-negative 
Ricci  curvature,  and  let  0  <  <  H2  <  •  •  •  denote  the  eigenvalues  of  the  Laplacian  with 

Dirichlet  boundary  conditions.  Then 

°i(d)  (J^j  <  hj  <  c2{d)  )  (82) 

where  V  is  the  volume  of  M  and  c\  and  c2  are  constants  depending  only  on  the  dimension. 

Note  that  the  manifold  of  the  multinomial  model  satisfies  the  conditions  of  this  theo¬ 
rem.  Using  these  results  we  can  establish  the  following  bounds  on  covering  numbers  for 
information  diffusion  kernels.  We  assume  Dirichlet  boundary  conditions;  a  similar  result 
can  be  proven  for  Neumann  boundary  conditions.  We  include  the  constant  V  =  vol(M) 
and  diffusion  coefficient  t  in  order  to  indicate  how  the  bounds  depend  on  the  geometry. 


Theorem  4  Let  M  be  a  compact  Riemannian  manifold,  with  volume  V,  satisfying  the 
conditions  of  Theorem  3.  Then  the  covering  numbers  for  the  Dirichlet  heat  kernel  Kf  on  M 
satisfy 

\ogN{e,Dn{x))  =  O  loS^~  (j))  (83) 

Proof  By  the  lower  bound  in  Theorem  3,  the  Dirichlet  eigenvalues  of  the  heat  kernel 

2 

Kt(x,y),  which  are  given  by  A  j  =  e~tfJ,j ,  satisfy  logAj  <  —tc\{d)  Thus, 


> 


yL 

J  i= 1 


2 

d, 


H —  log  n  >  t.c i 
J 


d 

d+  2 


2 

H —  log  n 
J 


(84) 


where  the  second  inequality  comes  from  ip  >  xp  dx  =  .  Now  using  the  upper 

bound  of  Theorem  3,  the  inequality  j*  <  j  will  hold  if 


tc2 


i+_2\d 

V 


>  —  log  Aj+i  >  fci 


d  ,  2 


d  +  2  VU 


—  +  -  log  n 


or  equivalently 


tc2 

2 

Vd 


—  Ca  d  d-\-2, 

j(j  +  2)<i  -  C2(J  +  2J  d  )  >  21ogn 


The  above  inequality  will  hold  in  case 


j  > 


2  Vd 


d  ' 

d+ 2 


tfa-Cizk 


logn 


> 


Vd(d  +  2) 
tc\ 


d 

d+ 2 


log  n 


(85) 


(86) 


(87) 
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since  we  may  assume  that  C2  >  ci;  thus,  j*  < 


ci  (  ^  log  n 


d  ' 
d+ 2 


for  a  new  constant  c\(d). 


Plugging  this  bound  on  j*  into  the  expression  for  e*  in  Theorem  2  and  using 


E 


i=j 


O 


we  have  after  some  algebra  that 


log 


=  n 


d 

d+2 


log  d+2  n 


(88) 


(89) 


Inverting  the  above  expression  in  logn  gives  equation  (83).  ■ 

We  note  that  Theorem  4  of  Guo  et  al.  (2002)  can  be  used  to  show  that  this  bound  does  not, 
in  fact,  depend  on  m  and  x.  Thus,  for  fixed  t  the  covering  numbers  scale  as  logA(e,.E")  = 
O  (k)g^2“  (7)))  and  for  fixed  e  they  scale  as  logAA(e,  P)  =  O  in  the  diffusion  time  t. 


4.2  Rademacher  Averages 

We  now  describe  a  different  family  of  generalization  error  bounds  that  can  be  derived  using 
the  machinery  of  Rademacher  averages  (Bartlett  and  Mendelson,  2002,  Bartlett  et  al.,  2003). 
The  bounds  fall  out  directly  from  the  work  of  Mendelson  (2003)  on  computing  local  averages 
for  kernel-based  function  classes,  after  plugging  in  the  eigenvalue  bounds  of  Theorem  3. 

As  seen  above,  covering  number  bounds  are  related  to  a  complexity  term  of  the  form 


C(n) 


+  Xi 
i=3n 


(90) 


In  the  case  of  Rademacher  complexities,  risk  bounds  are  instead  controlled  by  a  similar,  yet 
simpler  expression  of  the  form 


C\r) 


OO 


+  Xi 


(91) 


where  now  j*  is  the  smallest  integer  j  for  which  A  j  <  r  (Mendelson,  2003),  with  r  acting  as 
a  parameter  bounding  the  error  of  the  family  of  functions.  To  place  this  into  some  context, 
we  quote  the  following  results  from  Bartlett  et  al.  (2003)  and  Mendelson  (2003),  which 
apply  to  a  family  of  loss  functions  that  includes  the  quadratic  loss;  we  refer  to  Bartlett 
et  al.  (2003)  for  details  on  the  technical  conditions. 

Let  (Ai,  Y\ ) ,  (A2,  T2)  *  *  * ,  (An,  Yn)  be  an  independent  sample  from  an  unknown  distri¬ 
bution  P  on  X  x  y,  where  y  C  R.  For  a  given  loss  function  l  :  y  x  y  —■ >  M,  and  a 
family  5  of  measurable  functions  /  :  X  — >  T,  the  objective  is  to  minimize  the  expected 
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loss  E[£(f(X),Y)].  Let  Elf*  =  inf fe$Elf,  where  lf(X,Y)  =  l(f(X),Y),  and  let  /  be 
any  member  of  $  for  which  Enlj  =  inf f^Enlf  where  En  denotes  the  empirical  expecta¬ 
tion.  The  Rademacher  average  of  a  family  of  functions  ©  =  {g  :  X  — >  R}  is  defined  as  the 
expectation  ERn0  =  E[swpg€@Rng]  with  Rng  =  A  a>  (Axi)i  where  ai,...,an  are 

independent  Rademacher  random  variables;  that  is,  p(ai  =  1)  =  p(oi  =  —1)  = 


Theorem  5  Let  $  be  a  convex  class  of  functions  and  define  if  by 

Mr)  =  a  ERn  {/  G  5  :  E(f  —  f  )2  <  r]  +  ^  (92) 

where  a  and  b  are  constants  that  depend  on  the  loss  function  l.  Then  when  r  >  if(r), 

E(lj-lf^)  EcrY6^-  (93) 

with  probability  at  least  1  —  e~x,  where  c  and  d  are  additional  constants. 

Moreover,  suppose  that  K  is  a  Mercer  kernel  and  (£={/£  TLk  ■  \\f\\K  <  1}  is  the  unit 
ball  in  the  reproducing  kernel  Hilbert,  space  associated  with  K.  Then 


ip(r)  <  a 


\ 


o  °° 

-E 

n 

3= i 


rninlr,  A,}  H - 

n 


(94) 


Thus,  to  bound  the  excess  risk  for  kernel  machines  in  this  framework  it  suffices  to  bound 
the  term 


Mr) 


OO 


\ 

\ 


^min{r,Aj} 

3= 1 

(95) 

OO 

jfr+Y^  Aj 

(96) 

i=3r 


involving  the  spectrum.  Given  bounds  on  the  eigenvalues,  this  is  typically  easy  to  do. 


Theorem  6  Let  M  be  a  compact  Riemannian  manifold,  satisfying  the  conditions  of  Theo¬ 
rem  3.  Then  the  Rademacher  term  if  for  the  Dirichlet.  heat  kernel  Kt  on  M  satisfies 

Mr)  <  cJ ^4)  log^  (J)  (97) 

for  some  constant  C  depending  on  the  geometry  of  M. 
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Proof  We  have  that 


OO 


V’2(r)  = 

^min{r,  A  j} 

3= 1 

(98) 

= 

OO 

j*  r+Yl  e~tH 

(99) 

3=3* 

< 

OO  2 

(100) 

< 

3=3* 

j*  r  +  Ce~tc d 

(101) 

for  some  constant  C ,  where  the  first  inequality  follows  from  the  lower  bound  in 

Theorem  3. 

But  j*  <  j  in  case  logAj+i  >  r,  or,  again  from  Theorem  3,  if 

tc2(j  +  1)1  <  —  logAj  <  log  -  (102) 

or  equivalently, 

(l03) 

1 2  Vr/ 

It  follows  that 

$2(r)  <  C"  ^4)  loS^  (104) 

for  some  new  constant  C" .  ■ 

From  this  bound,  it  can  be  shown  that,  with  high  probability, 

*(</-*)  =  4^)  <«*> 

which  is  the  behavior  expected  of  the  Gaussian  kernel  for  Euclidean  space. 

Thus,  for  both  covering  numbers  and  Rademacher  averages,  the  resulting  bounds  are 
essentially  the  same  as  those  that  would  be  obtained  for  the  Gaussian  kernel  on  the  flat 
d-dimensional  torus,  which  is  the  standard  way  of  “compactifying”  Euclidean  space  to  get  a 
Laplacian  having  only  discrete  spectrum;  the  results  of  Guo  et  al.  (2002)  are  formulated  for 
the  case  d  =  1,  corresponding  to  the  circle  S1.  While  the  bounds  for  diffusion  kernels  were 
derived  for  the  case  of  positive  curvature,  which  apply  to  the  special  case  of  the  multinomial, 
similar  bounds  for  general  manifolds  with  curvature  bounded  below  by  a  negative  constant 
should  also  be  attainable. 

5  Multinomial  Diffusion  Kernels  and  Text  Classification 

In  this  section  we  present  the  application  of  multinomial  diffusion  kernels  to  the  problem  of 
text  classification.  Text  processing  can  be  subject  to  some  of  the  “dirty  laundry”  referred  to 
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in  the  introduction — documents  are  cast  as  Euclidean  space  vectors  with  special  weighting 
schemes  that  have  been  empirically  honed  through  applications  in  information  retrieval, 
rather  than  inspired  from  first  principles.  However  for  text,  the  use  of  multinomial  geometry 
is  natural  and  well  motivated;  our  experimental  results  offer  some  insight  into  how  useful 
this  geometry  may  be  for  classification. 


5.1  Representing  Documents 

Assuming  a  vocabulary  V  of  size  n  +  1,  a  document  may  be  represented  as  a  sequence 
of  words  over  the  alphabet  V.  For  many  classification  tasks  it  is  not  unreasonable  to 
discard  word  order;  indeed,  humans  can  typically  easily  understand  the  high  level  topic  of  a 
document  by  inspecting  its  contents  as  a  mixed  up  “bag  of  words.”  Let  xv  denote  the  number 
of  times  term  v  appears  in  a  document.  Then  {xv}V£v  is  the  sample  space  of  the  multinomial 
distribution,  with  a  document  modeled  as  independent  draws  from  a  fixed  model,  which  may 
change  from  document  to  document.  It  is  natural  to  embed  documents  in  the  multinomial 
simplex  using  an  embedding  function  9  :  Z”+1  — ►  Vn.  We  consider  several  embeddings 
9  that  correspond  to  well  known  feature  representations  in  text  classification  (Joachims, 
2000).  The  term  frequency  (tf)  representation  uses  normalized  counts;  the  corresponding 
embedding  is  the  maximum  likelihood  estimator  for  the  multinomial  distribution 


0tf(z) 


X\  %n-\-l  \ 

Ei®*’"”  Ei®*/  ' 


(106) 


Another  common  representation  is  based  on  term  frequency,  inverse  document  frequency 
(tfidf).  This  representation  uses  the  distribution  of  terms  across  documents  to  discount 
common  terms;  the  document  frequency  dfv  of  term  v  is  defined  as  the  number  of  documents 
in  which  term  v  appears.  Although  many  variants  have  been  proposed,  one  of  the  simplest 
and  most  commonly  used  embeddings  is 


x1\og(D/dfi)  xn+1\og{D  /  dfn+i)\ 

Ei  xi  l°g(-£V dfi)  ’ '  ”  ’  Y.ixdog(D / dfi)  ) 

where  D  is  the  number  of  documents  in  the  corpus. 

We  note  that  in  text  classification  applications  the  tf  and  tfidf  representations  are  typi¬ 
cally  normalized  to  unit  length  in  the  L2  norm  rather  than  the  L\  norm,  as  above  (Joachims, 
2000).  For  example,  the  tf  representation  with  L2  normalization  is  given  by 


Xi  Xn-\-\  \ 

e^T'^e^F; 

and  similarly  for  tfidf.  When  used  in  support  vector  machines  with  linear  or  Gaussian  ker¬ 
nels,  ^-normalized  tf  and  tfidf  achieve  higher  accuracies  than  their  Li-normalized  coun¬ 
terparts.  However,  for  the  diffusion  kernels,  L\  normalization  is  necessary  to  obtain  an 
embedding  into  the  simplex.  These  different  embeddings  or  feature  representations  are 
compared  in  the  experimental  results  reported  below. 
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To  be  clear,  we  list  the  three  kernels  we  compare.  First,  the  linear  kernel  is  given  by 

n+ 1 

KLin(e,  o')  =  e  ■  o'  =  ov  e'v  (109) 

v=l 

The  Gaussian  kernel  is  given  by 

K^uss{d',d')  =  (27ra)-^exp^—  (110) 

where  || 9  —  0'H2  =  Ylv= l  \@v  ~  @v\2  is  the  squared  Euclidean  distance.  The  multinomial 
diffusion  kernel  is  given  by 

I<fult{9,  9')  =  (4t rt)-f  exp  *  arccos^V^  •  V9')\  (111) 

as  derived  in  Section  3. 

5.2  Experimental  Results 

In  our  experiments,  the  multinomial  diffusion  kernel  using  the  tf  embedding  was  compared 
to  the  linear  or  Gaussian  (RBF)  kernel  with  tf  and  tfidf  embeddings  using  a  support  vector 
machine  classifier  on  the  WebKB  and  Reuters-21578  collections,  which  are  standard  data 
sets  for  text  classification. 

Figure  5  shows  the  test  set  error  rate  for  the  WebKB  data,  for  a  representative  instance 
of  the  one-versus-all  classification  task;  the  designated  class  was  course.  The  results  for 
the  other  choices  of  positive  class  were  qualitatively  very  similar;  all  of  the  results  are 
summarized  in  Table  1.  Similarly,  Figure  7  shows  the  test  set  error  rates  for  two  of  the 
one-versus-all  experiments  on  the  Reuters  data,  where  the  designated  classes  were  chosen 
to  be  acq  and  moneyFx.  All  of  the  results  for  Reuters  one-versus-all  tasks  are  shown  in 
Table  3. 

The  WebKb  dataset  contains  web  pages  found  on  the  sites  of  four  universities  (Craven 
et  al.,  2000).  The  pages  were  classified  according  to  whether  they  were  student,  faculty, 
course,  project  or  staff  pages;  these  categories  contain  1641,  1124,  929,  504  and  137  in¬ 
stances,  respectively.  Since  only  the  student,  faculty,  course  and  project  classes  contain 
more  than  500  documents  each,  we  restricted  our  attention  to  these  classes.  The  Reuters- 
21578  dataset  is  a  collection  of  newswire  articles  classified  according  to  news  topic  (Lewis 
and  Ringuette,  1994).  Although  there  are  more  than  135  topics,  most  of  the  topics  have 
fewer  than  100  documents;  for  this  reason,  we  restricted  our  attention  to  the  following  five 
most  frequent  classes:  earn,  acq,  moneyFx,  grain  and  crude,  of  sizes  3964,  2369,  717,  582 
and  578  documents,  respectively. 

For  both  the  WebKB  and  Reuters  collections  we  created  two  types  of  binary  classification 
tasks.  In  the  first  task  we  designate  a  specific  class,  label  each  document  in  the  class  as 
a  “positive”  example,  and  label  each  document  on  any  of  the  other  topics  as  a  “negative” 
example.  In  the  second  task  we  designate  a  class  as  the  positive  class,  and  choose  the 
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Figure  5:  Experimental  results  on  the  WebKB  corpus,  using  SVMs  for  linear  (dotted)  and 
Gaussian  (dash-dotted)  kernels,  compared  with  the  diffusion  kernel  for  the  multinomial 
(solid).  Classification  error  for  the  task  of  labeling  course  vs.  either  faculty,  project,  or 
student  is  shown  in  these  plots,  as  a  function  of  training  set  size.  The  left  plot  uses  tf  repre¬ 
sentation  and  the  right  plot  uses  tfidf  representation.  The  curves  shown  are  the  error  rates 
averaged  over  20-fold  cross  validation,  with  error  bars  representing  one  standard  deviation. 
The  results  for  the  other  “1  vs.  all”  labeling  tasks  are  qualitatively  similar,  and  are  therefore 
not  shown. 


Figure  6:  Results  on  the  WebKB  corpus,  using  SVMs  for  linear  (dotted)  and  Gaussian 
(dash-dotted)  kernels,  compared  with  the  diffusion  kernel  (solid).  The  course  pages  are 
labeled  positive  and  the  student  pages  are  labeled  negative;  results  for  other  label  pairs 
are  qualitatively  similar.  The  left  plot  uses  tf  representation  and  the  right  plot  uses  tfidf 
representation. 
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Figure  7:  Experimental  results  on  the  Reuters  corpus,  using  SVMs  for  linear  (dotted)  and 
Gaussian  (dash-dotted)  kernels,  compared  with  the  diffusion  kernel  (solid).  The  classes  acq 
(top),  and  moneyFx  (bottom)  are  shown;  the  other  classes  are  qualitatively  similar.  The 
left  column  uses  tf  representation  and  the  right  column  uses  tfidf.  The  curves  shown  are  the 
error  rates  averaged  over  20-fold  cross  validation,  with  error  bars  representing  one  standard 
deviation. 
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Figure  8:  Experimental  results  on  the  Reuters  corpus,  using  SVMs  for  linear  (dotted)  and 
Gaussian  (dash-dotted)  kernels,  compared  with  the  diffusion  (solid).  The  classes  moneyFx 
(top)  and  grain  (bottom)  are  labeled  as  positive,  and  the  class  earn  is  labeled  negative.  The 
left  column  uses  tf  representation  and  the  right  column  uses  tfidf  representation. 
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tf  Representation  tfidf  Representation 


Task 

L 

Linear 

Gaussian 

Diffusion 

Linear 

Gaussian 

Diffusion 

40 

0.1225 

0.1196 

0.0646 

0.0761 

0.0726 

0.0514 

80 

0.0809 

0.0805 

0.0469 

0.0569 

0.0564 

0.0357 

course  vs.  all 

120 

0.0675 

0.0670 

0.0383 

0.0473 

0.0469 

0.0291 

200 

0.0539 

0.0532 

0.0315 

0.0385 

0.0380 

0.0238 

400 

0.0412 

0.0406 

0.0241 

0.0304 

0.0300 

0.0182 

600 

0.0362 

0.0355 

0.0213 

0.0267 

0.0265 

0.0162 

40 

0.2336 

0.2303 

0.1859 

0.2493 

0.2469 

0.1947 

80 

0.1947 

0.1928 

0.1558 

0.2048 

0.2043 

0.1562 

faculty  vs.  all 

120 

0.1836 

0.1823 

0.1440 

0.1921 

0.1913 

0.1420 

200 

0.1641 

0.1634 

0.1258 

0.1748 

0.1742 

0.1269 

400 

0.1438 

0.1428 

0.1061 

0.1508 

0.1503 

0.1054 

600 

0.1308 

0.1297 

0.0931 

0.1372 

0.1364 

0.0933 

40 

0.1827 

0.1793 

0.1306 

0.1831 

0.1805 

0.1333 

80 

0.1426 

0.1416 

0.0978 

0.1378 

0.1367 

0.0982 

project  vs.  all 

120 

0.1213 

0.1209 

0.0834 

0.1169 

0.1163 

0.0834 

200 

0.1053 

0.1043 

0.0709 

0.1007 

0.0999 

0.0706 

400 

0.0785 

0.0766 

0.0537 

0.0802 

0.0790 

0.0574 

600 

0.0702 

0.0680 

0.0449 

0.0719 

0.0708 

0.0504 

40 

0.2417 

0.2411 

0.1834 

0.2100 

0.2086 

0.1740 

80 

0.1900 

0.1899 

0.1454 

0.1681 

0.1672 

0.1358 

student  vs.  all 

120 

0.1696 

0.1693 

0.1291 

0.1531 

0.1523 

0.1204 

200 

0.1539 

0.1539 

0.1134 

0.1349 

0.1344 

0.1043 

400 

0.1310 

0.1308 

0.0935 

0.1147 

0.1144 

0.0874 

600 

0.1173 

0.1169 

0.0818 

0.1063 

0.1059 

0.0802 

Table  1:  Experimental  results  on  the  WebKB  corpus,  using  SVMs  for  linear,  Gaussian,  and 
multinomial  diffusion  kernels.  The  left  columns  use  tf  representation  and  the  right  columns 
use  tfidf  representation.  The  error  rates  shown  are  averages  obtained  using  20-fold  cross 
validation.  The  best  performance  for  each  training  set  size  L  is  shown  in  boldface.  All 
differences  are  statistically  significant  according  to  the  paired  t  test  at  the  0.05  level. 


negative  class  to  be  the  most  frequent  remaining  class  (student  for  WebKB  and  earn  for 
Reuters).  In  both  cases,  the  size  of  the  training  set  is  varied  while  keeping  the  proportion 
of  positive  and  negative  documents  constant  in  both  the  training  and  test  set. 

Figure  6  and  Figure  8  show  representative  results  for  the  second  type  of  classification 
task,  where  the  goal  is  to  discriminate  between  two  specific  classes.  In  the  case  of  the 
WebKB  data  the  results  are  shown  for  course  vs.  student.  In  the  case  of  the  Reuters  data 
the  results  are  shown  for  moneyFx  vs.  earn  and  grain  vs.  earn.  Again,  the  results  for  the 
other  classes  are  qualitatively  similar;  the  numerical  results  are  summarized  in  Tables  2 
and  4. 
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tf  Representation  tfidf  Representation 


Task 

L 

Linear 

Gaussian 

Diffusion 

Linear 

Gaussian 

Diffusion 

40 

0.0808 

0.0802 

0.0391 

0.0580 

0.0572 

0.0363 

80 

0.0505 

0.0504 

0.0266 

0.0409 

0.0406 

0.0251 

course  vs.  student 

120 

0.0419 

0.0409 

0.0231 

0.0361 

0.0359 

0.0225 

200 

0.0333 

0.0328 

0.0184 

0.0310 

0.0308 

0.0201 

400 

0.0263 

0.0259 

0.0135 

0.0234 

0.0232 

0.0159 

600 

0.0228 

0.0221 

0.0117 

0.0207 

0.0202 

0.0141 

40 

0.2106 

0.2102 

0.1624 

0.2053 

0.2026 

0.1663 

80 

0.1766 

0.1764 

0.1357 

0.1729 

0.1718 

0.1335 

faculty  vs.  student 

120 

0.1624 

0.1618 

0.1198 

0.1578 

0.1573 

0.1187 

200 

0.1405 

0.1405 

0.0992 

0.1420 

0.1418 

0.1026 

400 

0.1160 

0.1158 

0.0759 

0.1166 

0.1165 

0.0781 

600 

0.1050 

0.1046 

0.0656 

0.1050 

0.1048 

0.0692 

40 

0.1434 

0.1430 

0.0908 

0.1304 

0.1279 

0.0863 

80 

0.1139 

0.1133 

0.0725 

0.0982 

0.0970 

0.0634 

project  vs.  student 

120 

0.0958 

0.0957 

0.0613 

0.0870 

0.0866 

0.0559 

200 

0.0781 

0.0775 

0.0514 

0.0729 

0.0722 

0.0472 

400 

0.0590 

0.0579 

0.0405 

0.0629 

0.0622 

0.0397 

600 

0.0515 

0.0500 

0.0325 

0.0551 

0.0539 

0.0358 

Table  2:  Experimental  results  on  the  WebKB  corpus,  using  SVMs  for  linear,  Gaussian,  and 
multinomial  diffusion  kernels.  The  left  columns  use  tf  representation  and  the  right  columns 
use  tfidf  representation.  The  error  rates  shown  are  averages  obtained  using  20-fold  cross 
validation.  The  best  performance  for  each  training  set  size  L  is  shown  in  boldface.  All 
differences  are  statistically  significant  according  to  the  paired  t  test  at  the  0.05  level. 


In  these  figures,  the  leftmost  plots  show  the  performance  of  tf  features  while  the  right¬ 
most  plots  show  the  performance  of  tfidf  features.  As  mentioned  above,  in  the  case  of  the 
diffusion  kernel  we  use  L\  normalization  to  give  a  valid  embedding  into  the  probability 
simplex,  while  for  the  linear  and  Gaussian  kernels  we  use  L2  normalization,  which  works 
better  empirically  than  L\  for  these  kernels.  The  curves  show  the  test  set  error  rates  aver¬ 
aged  over  20  iterations  of  cross  validation  as  a  function  of  the  training  set  size.  The  error 
bars  represent  one  standard  deviation.  For  both  the  Gaussian  and  diffusion  kernels,  we  test 
scale  parameters  (y/2cr  for  the  Gaussian  kernel  and  2 f1/2  for  the  diffusion  kernel)  in  the  set 
{0.5, 1, 2, 3, 4, 5,  7, 10}.  The  results  reported  are  for  the  best  parameter  value  in  that  range. 

We  also  performed  experiments  with  the  popular  Mod-Apte  train  and  test  split  for  the 
top  10  categories  of  the  Reuters  collection.  For  this  split,  the  training  set  has  about  7000 
documents  and  is  highly  biased  towards  negative  documents.  We  report  in  Table  5  the 
test  set  accuracies  for  the  tf  representation.  For  the  tfidf  representation,  the  difference 
between  the  different  kernels  is  not  statistically  significant  for  this  amount  of  training  and 
test  data.  The  provided  train  set  is  more  than  enough  to  achieve  outstanding  performance 
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tf  Representation  tfidf  Representation 


Task 

L 

Linear 

Gaussian 

Diffusion 

Linear 

Gaussian 

Diffusion 

80 

0.1107 

0.1106 

0.0971 

0.0823 

0.0827 

0.0762 

120 

0.0988 

0.0990 

0.0853 

0.0710 

0.0715 

0.0646 

earn  vs.  all 

200 

0.0808 

0.0810 

0.0660 

0.0535 

0.0538 

0.0480 

400 

0.0578 

0.0578 

0.0456 

0.0404 

0.0408 

0.0358 

600 

0.0465 

0.0464 

0.0367 

0.0323 

0.0325 

0.0290 

80 

0.1126 

0.1125 

0.0846 

0.0788 

0.0785 

0.0667 

120 

0.0886 

0.0885 

0.0697 

0.0632 

0.0632 

0.0534 

acq  vs.  all 

200 

0.0678 

0.0676 

0.0562 

0.0499 

0.0500 

0.0441 

400 

0.0506 

0.0503 

0.0419 

0.0370 

0.0369 

0.0335 

600 

0.0439 

0.0435 

0.0363 

0.0318 

0.0316 

0.0301 

80 

0.1201 

0.1198 

0.0758 

0.0676 

0.0669 

0.0647* 

120 

0.0986 

0.0979 

0.0639 

0.0557 

0.0545 

0.0531* 

money Fx  vs.  all 

200 

0.0814 

0.0811 

0.0544 

0.0485 

0.0472 

0.0438 

400 

0.0578 

0.0567 

0.0416 

0.0427 

0.0418 

0.0392 

600 

0.0478 

0.0467 

0.0375 

0.0391 

0.0385 

0.0369* 

80 

0.1443 

0.1440 

0.0925 

0.0536 

0.0518* 

0.0595 

120 

0.1101 

0.1097 

0.0717 

0.0476 

0.0467* 

0.0494 

grain  vs.  all 

200 

0.0793 

0.0786 

0.0576 

0.0430 

0.0420* 

0.0440 

400 

0.0590 

0.0573 

0.0450 

0.0349 

0.0340* 

0.0365 

600 

0.0517 

0.0497 

0.0401 

0.0290 

0.0284* 

0.0306 

80 

0.1396 

0.1396 

0.0865 

0.0502 

0.0485* 

0.0524 

120 

0.0961 

0.0953 

0.0542 

0.0446 

0.0425* 

0.0428 

crude  vs.  all 

200 

0.0624 

0.0613 

0.0414 

0.0388 

0.0373 

0.0345* 

400 

0.0409 

0.0403 

0.0325 

0.0345 

0.0337 

0.0297 

600 

0.0379 

0.0362 

0.0299 

0.0292 

0.0284 

0.0264* 

Table  3:  Experimental  results  on  the  Reuters  corpus,  using  SVMs  for  linear,  Gaussian, 
and  multinomial  diffusion  kernels.  The  left  columns  use  tf  representation  and  the  right 
columns  use  tfidf  representation.  The  error  rates  shown  are  averages  obtained  using  20-fold 
cross  validation.  The  best  performance  for  each  training  set  size  L  is  shown  in  boldface. 
An  asterisk  (*)  indicates  that  the  difference  is  not  statistically  significant  according  to  the 
paired  t  test  at  the  0.05  level. 


with  all  kernels  used,  and  the  absence  of  cross  validation  data  makes  the  results  too  noisy 
for  interpretation. 

Our  results  are  consistent  with  previous  experiments  in  text  classification  using  SVMs, 
which  have  observed  that  the  linear  and  Gaussian  kernels  result  in  very  similar  performance 
(Joachims  et  al.,  2001).  However  the  multinomial  diffusion  kernel  significantly  outperforms 
the  linear  and  Gaussian  kernels  for  the  tf  representation,  achieving  significantly  lower  error 
rate  than  the  other  kernels.  For  the  tfidf  representation,  the  diffusion  kernel  consistently 
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tf  Representation  tfidf  Representation 


Task 

L 

Linear 

Gaussian 

Diffusion 

Linear 

Gaussian 

Diffusion 

40 

0.1043 

0.1043 

0.1021* 

0.0829 

0.0831 

0.0814* 

80 

0.0902 

0.0902 

0.0856* 

0.0764 

0.0767 

0.0730* 

acq  vs.  earn 

120 

0.0795 

0.0796 

0.0715 

0.0626 

0.0628 

0.0562 

200 

0.0599 

0.0599 

0.0497 

0.0509 

0.0511 

0.0431 

400 

0.0417 

0.0417 

0.0340 

0.0336 

0.0337 

0.0294 

40 

0.0759 

0.0758 

0.0474 

0.0451 

0.0451 

0.0372* 

80 

0.0442 

0.0443 

0.0238 

0.0246 

0.0246 

0.0177 

money Fx  vs.  earn 

120 

0.0313 

0.0311 

0.0160 

0.0179 

0.0179 

0.0120 

200 

0.0244 

0.0237 

0.0118 

0.0113 

0.0113 

0.0080 

400 

0.0144 

0.0142 

0.0079 

0.0080 

0.0079 

0.0062 

40 

0.0969 

0.0970 

0.0543 

0.0365 

0.0366 

0.0336* 

80 

0.0593 

0.0594 

0.0275 

0.0231 

0.0231 

0.0201* 

grain  vs.  earn 

120 

0.0379 

0.0377 

0.0158 

0.0147 

0.0147 

0.0114* 

200 

0.0221 

0.0219 

0.0091 

0.0082 

0.0081 

0.0069* 

400 

0.0107 

0.0105 

0.0060 

0.0037 

0.0037 

0.0037* 

40 

0.1108 

0.1107 

0.0950 

0.0583* 

0.0586 

0.0590 

80 

0.0759 

0.0757 

0.0552 

0.0376 

0.0377 

0.0366* 

crude  vs.  earn 

120 

0.0608 

0.0607 

0.0415 

0.0276 

0.0276* 

0.0284 

200 

0.0410 

0.0411 

0.0267 

0.0218* 

0.0218 

0.0225 

400 

0.0261 

0.0257 

0.0194 

0.0176 

0.0171* 

0.0181 

Table  4:  Experimental  results  on  the  Reuters  corpus,  using  SVMs  for  linear,  Gaussian, 
and  multinomial  diffusion  kernels.  The  left  columns  use  tf  representation  and  the  right 
columns  use  tfidf  representation.  The  error  rates  shown  are  averages  obtained  using  20-fold 
cross  validation.  The  best  performance  for  each  training  set  size  L  is  shown  in  boldface. 
An  asterisk  (*)  indicates  that  the  difference  is  not  statistically  significant  according  to  the 
paired  t  test  at  the  0.05  level. 


outperforms  the  other  kernels  for  the  WebKb  data  and  usually  outperforms  the  linear  and 
Gaussian  kernels  for  the  Reuters  data.  The  Reuters  data  is  a  much  larger  collection  than 
WebKB,  and  the  document  frequency  statistics,  which  are  the  basis  for  the  inverse  document 
frequency  weighting  in  the  tfidf  representation,  are  evidently  much  more  effective  on  this 
collection.  It  is  notable,  however,  that  the  multinomial  information  diffusion  kernel  achieves 
at  least  as  high  an  accuracy  without  the  use  of  any  heuristic  term  weighting  scheme.  These 
results  offer  evidence  that  the  use  of  multinomial  geometry  is  both  theoretically  motivated 
and  practically  effective  for  document  classification. 
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Category 

Linear 

RBF 

Diffusion 

earn 

0.01159 

0.01159 

0.01026 

acq 

0.01854 

0.01854 

0.01788 

money-fx 

0.02418 

0.02451 

0.02219 

grain 

0.01391 

0.01391 

0.01060 

crude 

0.01755 

0.01656 

0.01490 

trade 

0.01722 

0.01656 

0.01689 

interest 

0.01854 

0.01854 

0.01689 

ship 

0.01324 

0.01324 

0.01225 

wheat 

0.00894 

0.00794 

0.00629 

corn 

0.00794 

0.00794 

0.00563 

Table  5:  Test  set  error  rates  for  the  Reuters  top  10  classes  using  tf  features.  The  train  and 
test  sets  were  created  using  the  Mod-Apte  split. 

6  Discussion  and  Conclusion 

This  paper  has  introduced  a  family  of  kernels  that  is  intimately  based  on  the  geometry  of 
the  Riemannian  manifold  associated  with  a  statistical  family  through  the  Fisher  information 
metric.  The  metric  is  canonical  in  the  sense  that  it  is  uniquely  determined  by  requirements 
of  invariance  (Cencov,  1982),  and  moreover,  the  choice  of  the  heat  kernel  is  natural  because 
it  effectively  encodes  a  great  deal  of  geometric  information  about  the  manifold.  While  the 
geometric  perspective  in  statistics  has  most  often  led  to  reformulations  of  results  that  can 
be  viewed  more  traditionally,  the  kernel  methods  developed  here  clearly  depend  crucially 
on  the  geometry  of  statistical  families. 

The  main  application  of  these  ideas  has  been  to  develop  the  multinomial  diffusion  kernel. 
A  related  use  of  spherical  geometry  for  the  multinomial  has  been  developed  by  Gous  (1998). 
Our  experimental  results  indicate  that  the  resulting  diffusion  kernel  is  indeed  effective  for 
text  classification  using  support  vector  machine  classifiers,  and  can  lead  to  significant  im¬ 
provements  in  accuracy  compared  with  the  use  of  linear  or  Gaussian  kernels,  which  have 
been  the  standard  for  this  application.  The  results  of  Section  5  are  notable  since  accuracies 
better  or  comparable  to  those  obtained  using  heuristic  weighting  schemes  such  as  tfidf  are 
achieved  directly  through  the  geometric  approach.  In  part,  this  can  be  attributed  to  the 
role  of  the  Fisher  information  metric;  because  of  the  square  root  in  the  embedding  into  the 
sphere,  terms  that  are  infrequent  in  a  document  are  effectively  up-weighted,  and  such  terms 
are  typically  rare  in  the  document  collection  overall.  The  primary  degree  of  freedom  in  the 
use  of  information  diffusion  kernels  lies  in  the  specification  of  the  mapping  of  data  to  model 
parameters.  For  the  multinomial,  we  have  used  the  maximum  likelihood  mapping.  The  use 
of  other  model  families  and  mappings  remains  an  interesting  direction  to  explore. 

While  kernel  methods  generally  are  “model  free,”  and  do  not  make  distributional  as¬ 
sumptions  about  the  data  that  the  learning  algorithm  is  applied  to,  statistical  models  offer 
many  advantages,  and  thus  it  is  attractive  to  explore  methods  that  combine  data  models 
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and  purely  discriminative  methods.  Our  approach  combines  parametric  statistical  modeling 
with  non-parametric  discriminative  learning,  guided  by  geometric  considerations.  In  these 
aspects  it  is  related  to  the  methods  proposed  by  Jaakkola  and  Haussler  (1998).  However,  the 
kernels  proposed  in  the  current  paper  differ  significantly  from  the  Fisher  kernel  of  Jaakkola 
and  Haussler  (1998).  In  particular,  the  latter  is  based  on  the  score  V  o\ogp(X  \  9)  at  a  sin¬ 
gle  point  9  in  parameter  space.  In  the  case  of  an  exponential  family  model  it  is  given  by  a 
covariance  Kp(x,x')  =  Yhi  {xi  ~  EAX,f\)  (x'(  —  E;} \Xf ) ;  this  covariance  is  then  heuristically 
exponentiated.  In  contrast,  information  diffusion  kernels  are  based  on  the  full  geometry  of 
the  statistical  family,  and  yet  are  also  invariant  under  reparameterization  of  the  family.  In 
other  conceptually  related  work,  Belkin  and  Niyogi  (2003)  suggest  measuring  distances  on 
the  data  graph  to  approximate  the  underlying  manifold  structure  of  the  data.  In  this  case 
the  underlying  geometry  is  inherited  from  the  embedding  Euclidean  space  rather  than  the 
Fisher  geometry. 

While  information  diffusion  kernels  are  very  general,  they  will  be  difficult  to  compute  in 
many  cases — explicit  formulas  such  as  equations  (52-53)  for  hyperbolic  space  are  rare.  To 
approximate  an  information  diffusion  kernel  it  may  be  attractive  to  use  the  parametrices 
and  geodesic  distance  between  points,  as  we  have  done  for  the  multinomial.  In  cases  where 
the  distance  itself  is  difficult  to  compute  exactly,  a  compromise  may  be  to  approximate  the 
distance  between  nearby  points  in  terms  of  the  Kullback-Leibler  divergence,  using  the  rela¬ 
tion  with  the  Fisher  information  that  was  noted  in  Section  3.1.  In  effect,  this  approximation 
is  already  incorporated  into  the  kernels  recently  proposed  by  Moreno  et  al.  (2004)  for  mul¬ 
timedia  applications,  which  have  the  form  K (9,  9')  oc  exp(— aD(9,  9'))  ~  exp(— 2 ad2(9,  9')), 
and  so  can  be  viewed  in  terms  of  the  leading  order  approximation  to  the  heat  kernel.  The 
results  of  Moreno  et  al.  (2004)  are  suggestive  that  diffusion  kernels  may  be  attractive  not 
only  for  multinomial  geometry,  but  also  for  much  more  complex  statistical  families. 
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